Evaluating the Quality of Web-Mined Bilingual Sentence Pairs

نویسندگان

  • Xiaohua Liu
  • Ming Zhou
چکیده

We come up with the problem of evaluating the quality of bilingual sentence pairs mined from the web, which is critical for a wide range of applications such as statistical machine translation (SMT) and English as Second Language (ESL) learning. To address this problem, we propose a novel method that integrates multiple linguistic features related to spelling, grammar, alignment, and particularly the sentence type feature that indicates if a sentence can be parsed by the Link Grammar Parser (LGP). Promising results are achieved on a bilingual corpus of about 6 million English-Chinese sentences mined from the web, indicating the effectiveness of our proposed method.

منابع مشابه

Detecting Erroneous Sentences using Automatically Mined Sequential Patterns

This paper studies the problem of identifying erroneous/correct sentences. The problem has important applications, e.g., providing feedback for writers of English as a Second Language, controlling the quality of parallel bilingual sentences mined from the Web, and evaluating machine translation results. In this paper, we propose a new approach to detecting erroneous sentences by integrating pat...

متن کامل

An Efficient Framework to Extract Parallel Units from Comparable Data

Since the quality of statistical machine translation (SMT) is heavily dependent upon the size and quality of training data, many approaches have been proposed for automatically mining bilingual text from comparable corpora. However, the existing solutions are restricted to extract either bilingual sentences or sub-sentential fragments. Instead, we present an efficient framework to extract both ...

متن کامل

Finding More Bilingual Webpages with High Credibility via Link Analysis

This paper presents an efficient approach to finding more bilingual webpage pairs with high credibility via link analysis, using little prior knowledge or heuristics. It extends from a previous algorithm that takes the number of bilingual URL pairs that a key (i.e., a URL pairing pattern) can match as the objective function to search for the best set of keys yielding the greatest number of webp...

متن کامل

DOMCAT: A Bilingual Concordancer for Domain-Specific Computer Assisted Translation

In this paper, we propose a web-based bilingual concordancer, DOMCAT 1 , for domain-specific computer assisted translation. Given a multi-word expression as a query, the system involves retrieving sentence pairs from a bilingual corpus, identifying translation equivalents of the query in the sentence pairs (translation spotting) and ranking the retrieved sentence pairs according to the relevanc...

متن کامل

Parallel Sentences Mining From The Web

Parallel sentences can benefit many NLP applications (e.g., machine translation, cross language information retrieval.) In this paper, the candidate bilingual webs pages are returned by submit sentence pairs to search engine and then validated by surface patterns. We propose an algorithm to candidate bilingual resource extraction and filter useless bilingual web pages. The pair sentences includ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

متن کامل
عنوان ژورنال:
  • Int. J. of Asian Lang. Proc.

دوره 20  شماره 

صفحات  -

تاریخ انتشار 2010